Show the code
import pandas as pd
import numpy as np
import sqlite3
from lets_plot import *
import statsmodels.formula.api as smf
LetsPlot.setup_html(isolated_frame=True)import pandas as pd
import numpy as np
import sqlite3
from lets_plot import *
import statsmodels.formula.api as smf
LetsPlot.setup_html(isolated_frame=True)sqlite_file = 'lahman_1871-2022.sqlite'
con = sqlite3.connect(sqlite_file)
tables = pd.read_sql_query("SELECT name FROM sqlite_master WHERE type='table';", con)This section of the analysis will have the purpose visualizing trends, and cleaning data.
# Perform a query that returns the total salary, total wins, total losses, per year and team
base = pd.read_sql_query("""
WITH ps AS (
-- collect game wins for both winners and losers of each postseason series
SELECT yearID, teamIDwinner AS teamID, wins AS gwins FROM SeriesPost
UNION ALL
SELECT yearID, teamIDloser AS teamID, losses AS gwins FROM SeriesPost
),
ps_agg AS (
SELECT yearID, teamID, SUM(gwins) AS postseason_wins
FROM ps
GROUP BY yearID, teamID
),
ws AS (
SELECT yearID, teamIDwinner AS teamID, 1 AS won_ws
FROM SeriesPost
WHERE round = 'WS'
)
SELECT
t.yearID,
t.name,
SUM(s.salary) AS total_salary,
t.W AS wins,
t.L AS losses,
COALESCE(ps_agg.postseason_wins, 0) AS postseason_wins, -- <-- total PS wins (all rounds)
COALESCE(ws.won_ws, 0) AS won_world_series -- 1 if WS champ, else 0
FROM teams t
JOIN salaries s
ON t.teamID = s.teamID
AND t.yearID = s.yearID
LEFT JOIN ps_agg
ON ps_agg.teamID = t.teamID
AND ps_agg.yearID = t.yearID
LEFT JOIN ws
ON ws.teamID = t.teamID
AND ws.yearID = t.yearID
GROUP BY t.teamID, t.yearID, t.name, t.W, t.L, ps_agg.postseason_wins, ws.won_ws
ORDER BY t.yearID;
""", con)
# Calculate a wins per million dollars expended column
base['wins_per_million'] = base['wins'] / (base['total_salary'] / 1_000_000)
# Change won_world_series to boolean
base['won_world_series'] = base['won_world_series'].astype('bool')
# Convert Salary to millions
base['total_salary'] = base['total_salary'] / 1000000From the below histogram, we can see an obvious right skewed distribution, as one would expect when money is involved. This indicates many teams are pulling the average salary expenditure by paying their players much, much more.
ggplot(base, aes(x="total_salary")) +\
geom_histogram() +\
labs(title="Total Salary (millions) Expenditure Distribution", y="Number of Teams", x='Total Salary (millions)')From this next histogram, we see what appears to be an approximately normal distribution. This graph displays the number of wins in a season for each team spanning 150+ years.
ggplot(base, aes(x="wins")) +\
geom_histogram() +\
labs(title="Total Wins per Team & Year Distribution", y="Number of Teams", x='Total Wins in Regular Season')From the next boxplot of total salary expenditures by two groups, whether they won the world series or not. As one could have guessed, teams who won the world series tend to spend more on their players salaries. The mean expenditure between those who did not win, was 49.4 million, and 71 million for those who did win.
ggplot(base, aes(x="won_world_series", y='total_salary', fill='won_world_series')) +\
geom_boxplot() +\
scale_y_continuous(limits=(0,200)) +\
labs(title="Total Salary (millions) Among World Series Champs", y="Total Salary Expended (million)", x='Won World Series')From the next boxplot of Wins per Million in Salary Expenditure by two groups, whether they won the world series or not. Interestingly, teams who won the world series had a lower overall mean of wins per million, indicating less efficiency in this regard. One thought could be the idea that quality over quantity, if you want to win more, you must spend more “pound for pound”. Although, this should be taken with a grain of salt, as you can see hundreds of outliers above the boxplot representing those who did not win.
ggplot(base, aes(x="won_world_series", y='wins_per_million', fill='won_world_series')) +\
geom_boxplot() +\
scale_y_continuous(limits=(0,10)) +\
labs(title="Wins Per Million Salary Spent Among World Series Champs ", y="Wins Per Million", x='Won World Series')A few key things to learn from the aforementioned visualizations:
Interesting relationships seem to develop around salary expenditure, wins per million, and wins among the two groups (won world series vs not). We will now get into whether these emerging relationships are predictive of success and evaluate if these relationships are statistically significant.
First, let’s address our core research question by predicting success metrics from salary spending, rather than the reverse.
# Model 1: Predicting Regular Season Wins from Salary
model_wins = smf.ols(
"wins ~ total_salary",
data=base
).fit(cov_type="HC3")
print(model_wins.summary()) OLS Regression Results
==============================================================================
Dep. Variable: wins R-squared: 0.065
Model: OLS Adj. R-squared: 0.064
Method: Least Squares F-statistic: 82.52
Date: Sun, 28 Sep 2025 Prob (F-statistic): 6.31e-19
Time: 16:30:47 Log-Likelihood: -3540.2
No. Observations: 918 AIC: 7084.
Df Residuals: 916 BIC: 7094.
Df Model: 1
Covariance Type: HC3
================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 75.7688 0.639 118.658 0.000 74.517 77.020
total_salary 0.0695 0.008 9.084 0.000 0.055 0.085
==============================================================================
Omnibus: 7.485 Durbin-Watson: 1.847
Prob(Omnibus): 0.024 Jarque-Bera (JB): 6.557
Skew: -0.143 Prob(JB): 0.0377
Kurtosis: 2.701 Cond. No. 127.
==============================================================================
Notes:
[1] Standard Errors are heteroscedasticity robust (HC3)
Interpretation of Regular Season Wins Model:
The regression results show that salary spending has a statistically significant relationship with regular season wins (p < 0.001). Here’s what the coefficients mean:
This reveals that while salary spending does have a statistically significant positive effect on wins, the relationship is much weaker than initially expected. The low R-squared indicates that salary is far from the dominant factor in team success.
# Model 2: Predicting Postseason Wins from Salary
model_postseason = smf.ols(
"postseason_wins ~ total_salary",
data=base
).fit(cov_type="HC3")
print(model_postseason.summary()) OLS Regression Results
==============================================================================
Dep. Variable: postseason_wins R-squared: 0.036
Model: OLS Adj. R-squared: 0.035
Method: Least Squares F-statistic: 25.86
Date: Sun, 28 Sep 2025 Prob (F-statistic): 4.46e-07
Time: 16:30:47 Log-Likelihood: -2120.0
No. Observations: 918 AIC: 4244.
Df Residuals: 916 BIC: 4254.
Df Model: 1
Covariance Type: HC3
================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 0.3153 0.120 2.628 0.009 0.080 0.550
total_salary 0.0109 0.002 5.085 0.000 0.007 0.015
==============================================================================
Omnibus: 521.433 Durbin-Watson: 2.003
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2911.353
Skew: 2.717 Prob(JB): 0.00
Kurtosis: 9.826 Cond. No. 127.
==============================================================================
Notes:
[1] Standard Errors are heteroscedasticity robust (HC3)
Interpretation of Postseason Performance:
Interestingly, the postseason model shows salary spending is statistically significant (p < 0.001) but with an even weaker practical relationship:
While statistically significant, this relationship is practically negligible. The extremely low R-squared indicates that salary spending has almost no meaningful impact on postseason success.
# Model 3: Predicting World Series Championships (Logistic Regression)
base['won_world_series'] = base['won_world_series'].astype(int)
model_ws = smf.logit(
"won_world_series ~ total_salary",
data=base
).fit()
print(model_ws.summary())Optimization terminated successfully.
Current function value: 0.144936
Iterations 8
Logit Regression Results
==============================================================================
Dep. Variable: won_world_series No. Observations: 918
Model: Logit Df Residuals: 916
Method: MLE Df Model: 1
Date: Sun, 28 Sep 2025 Pseudo R-squ.: 0.01811
Time: 16:30:47 Log-Likelihood: -133.05
converged: True LL-Null: -135.51
Covariance Type: nonrobust LLR p-value: 0.02673
================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept -3.9167 0.328 -11.926 0.000 -4.560 -3.273
total_salary 0.0083 0.004 2.337 0.019 0.001 0.015
================================================================================
Interpretation of Championship Success:
The logistic regression for World Series victories shows a statistically significant but practically weak relationship (p = 0.019):
While statistically detectable, this relationship is the weakest of all three models. The extremely low pseudo R-squared indicates that salary spending has minimal predictive power for championship success.
# Calculate predictions for different payroll levels
low_salary_team = 50 # $50M
high_salary_team = 200 # $200M
predicted_wins_low = model_wins.params.iloc[0] + model_wins.params.iloc[1] * low_salary_team
predicted_wins_high = model_wins.params.iloc[0] + model_wins.params.iloc[1] * high_salary_team
print(f"Low salary team (${low_salary_team}M): {predicted_wins_low:.1f} predicted wins")
print(f"High salary team (${high_salary_team}M): {predicted_wins_high:.1f} predicted wins")
print(f"Difference: {predicted_wins_high - predicted_wins_low:.1f} more wins")Low salary team ($50M): 79.2 predicted wins
High salary team ($200M): 89.7 predicted wins
Difference: 10.4 more wins
Real-World Impact:
A team spending $200M versus $50M is predicted to win approximately 10.4 more games in the regular season. In a 162-game season, this represents about a 6.4% improvement in winning percentage, which could mean the difference between playoffs and missing out. However, this comes at a cost of $150M in additional payroll - roughly $14.4 million per additional win.
Does salary expenditure relate to postseason success?
The evidence shows a nuanced relationship:
Conclusion: Money helps teams reach the postseason through regular season success, but once in the playoffs, it provides diminishing returns. This suggests that while salary can buy talent, playoff success depends more on factors like team chemistry, coaching decisions, and situational performance that can’t be easily purchased.
Important Considerations:
These limitations suggest our findings should be interpreted as associations rather than definitive causal relationships.